Introduction

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML and PDF documents. R Markdown documents are useful for interspersing formatted text with code. There are many ways to write formatted text within markdown, and RStudio hosts a good cheatsheet (See section #3).

You run R code within chunks either line-by-line (ctrl-enter or cmd-enter), or all at once (see keyboard shortcuts). To do so, place your cursor on a line of code, and then press the keys to run the line.

Installing and Loading Packages

Here is a chunk of R code that installs all of the packages that will be used during the workshop. You should uncomment all of the lines (remove the #), select all lines, and then hit ctrl-enter or cmd-enter. You can also install packages using the install button on the packages pane in RStudio.

# install.packages('ggmap')
# install.packages('tidyverse')
# install.packages('gapminder')
# install.packages('cowplot')

Using package functions

The ggmap package is used for plotting maps and spatial data. We will use it today to learn how to run code in R, and to play around with functions a bit.

Note: You can supply arguments to chunks that can customize the behavior of those chunks. This chunk has the argument cache=TRUE supplied to it. This just tells the chunk to store the comet variable, which makes the document quicker to run if you convert it to html multiple times.

library(ggmap)

## Plot a map of texas -- Note it searches an online database for maps matching "texas"
qmap("texas", zoom=6, color="bw")

## Plot a map of UT now
qmap("University of Texas at Austin", zoom=15)

# You can create a variable with the "=" sign
pcl_location = geocode("101 E 21st St, Austin, TX 78712", source = "google")

## Now use a ggmap function to plot the map with the point.
## This is a function that is strung together with a "+" 
ggmap(get_map("University of Texas at Austin", zoom = 15)) + 
  geom_point(data=pcl_location, size = 7, shape = 13, color = "red")

Check out the environment pane on the top right of the RStudio screen. What do you notice? I tend to glance at that pane every once in a while to make sure variables are being created and changed as expected. e.g. we created the pcl_location variable in the previous chunk, so we can check to make sure it’s there.

Exercise 1

Copy one of the qmap() lines of code from the previous chunk, and paste it in the next chunk. Change the number for the zoom parameter (Can only be 3-21), and change the location within the quotes. Can you get a map of Africa, how about one of your hometown?

# R Code here

Exercise 1 (extended)

Create a new variable called home that has your current or past home address saved. Then plot it on a map like we did for PCL.

# R Code here

Introduction to dplyr

Now that you know how to use R Markdown and have learned about variables and functions, let’s play around with data frames using the tidyverse. Follow along with this during the presentation portion. The tidyverse is a system of packages created by Hadley Wickham that provide consistent and intuitive syntax for manipulating, analyzing, and visualizing data.

Using filter, select, and %>%

The filter and select functions allow for easy subsetting of the data (selecting specific portions). filter extracts rows fulfilling a specified expression, and select extracts columns specified by name or index (number). You can also select columns by giving the function columns you don’t want by simply adding a “-” in front of the column name.

We can link functions/commands together using the %>% operator. %>% takes the output of the left function or variable and puts it by default as the first argument to the right function. So df %>% head() is the same as head(df).

library(tidyverse)
library(gapminder)

# These next two lines do the same thing
gapminder %>% head()
## # A tibble: 6 × 6
##       country continent  year lifeExp      pop gdpPercap
##        <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan      Asia  1952  28.801  8425333  779.4453
## 2 Afghanistan      Asia  1957  30.332  9240934  820.8530
## 3 Afghanistan      Asia  1962  31.997 10267083  853.1007
## 4 Afghanistan      Asia  1967  34.020 11537966  836.1971
## 5 Afghanistan      Asia  1972  36.088 13079460  739.9811
## 6 Afghanistan      Asia  1977  38.438 14880372  786.1134
head(gapminder)
## # A tibble: 6 × 6
##       country continent  year lifeExp      pop gdpPercap
##        <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan      Asia  1952  28.801  8425333  779.4453
## 2 Afghanistan      Asia  1957  30.332  9240934  820.8530
## 3 Afghanistan      Asia  1962  31.997 10267083  853.1007
## 4 Afghanistan      Asia  1967  34.020 11537966  836.1971
## 5 Afghanistan      Asia  1972  36.088 13079460  739.9811
## 6 Afghanistan      Asia  1977  38.438 14880372  786.1134
# Example filtering rows
gapminder %>% filter(year==1952)
## # A tibble: 142 × 6
##        country continent  year lifeExp      pop  gdpPercap
##         <fctr>    <fctr> <int>   <dbl>    <int>      <dbl>
## 1  Afghanistan      Asia  1952  28.801  8425333   779.4453
## 2      Albania    Europe  1952  55.230  1282697  1601.0561
## 3      Algeria    Africa  1952  43.077  9279525  2449.0082
## 4       Angola    Africa  1952  30.015  4232095  3520.6103
## 5    Argentina  Americas  1952  62.485 17876956  5911.3151
## 6    Australia   Oceania  1952  69.120  8691212 10039.5956
## 7      Austria    Europe  1952  66.800  6927772  6137.0765
## 8      Bahrain      Asia  1952  50.939   120447  9867.0848
## 9   Bangladesh      Asia  1952  37.484 46886859   684.2442
## 10     Belgium    Europe  1952  68.000  8730405  8343.1051
## # ... with 132 more rows
# These next two lines result in the same data frame
gapminder %>% select(country:year)
## # A tibble: 1,704 × 3
##        country continent  year
##         <fctr>    <fctr> <int>
## 1  Afghanistan      Asia  1952
## 2  Afghanistan      Asia  1957
## 3  Afghanistan      Asia  1962
## 4  Afghanistan      Asia  1967
## 5  Afghanistan      Asia  1972
## 6  Afghanistan      Asia  1977
## 7  Afghanistan      Asia  1982
## 8  Afghanistan      Asia  1987
## 9  Afghanistan      Asia  1992
## 10 Afghanistan      Asia  1997
## # ... with 1,694 more rows
gapminder %>% select(country, continent, year)
## # A tibble: 1,704 × 3
##        country continent  year
##         <fctr>    <fctr> <int>
## 1  Afghanistan      Asia  1952
## 2  Afghanistan      Asia  1957
## 3  Afghanistan      Asia  1962
## 4  Afghanistan      Asia  1967
## 5  Afghanistan      Asia  1972
## 6  Afghanistan      Asia  1977
## 7  Afghanistan      Asia  1982
## 8  Afghanistan      Asia  1987
## 9  Afghanistan      Asia  1992
## 10 Afghanistan      Asia  1997
## # ... with 1,694 more rows
# We can link multiple statements together
# So if we want only the population data for year 1952 we could do this:
gapminder %>% filter(year==1952) %>%
  select(country, year, pop)
## # A tibble: 142 × 3
##        country  year      pop
##         <fctr> <int>    <int>
## 1  Afghanistan  1952  8425333
## 2      Albania  1952  1282697
## 3      Algeria  1952  9279525
## 4       Angola  1952  4232095
## 5    Argentina  1952 17876956
## 6    Australia  1952  8691212
## 7      Austria  1952  6927772
## 8      Bahrain  1952   120447
## 9   Bangladesh  1952 46886859
## 10     Belgium  1952  8730405
## # ... with 132 more rows

Exercise 2a

Try to use filter() and select() to subset your data to include only the country, year, and life expectancy data from Belgium.

# R Code here

Using mutate, group_by, and summarise

The mutate is a function that can add columns to data frames. Often times we make new columns in data frames out of combinations of old columns, and mutate makes this fairly straightforward.

group_by and summarise are commonly used in tandem. Often times we want to summarise data for a specific group of data. For example if we had data for the heights of all people on campus, we might want to know the mean for the two genders. We first would group our data by the gender, and then summarise the data with the mean.

# Add a gdp column using the per capita gdp and total population
gapminder %>% mutate(gdp = gdpPercap * pop) 
## # A tibble: 1,704 × 7
##        country continent  year lifeExp      pop gdpPercap         gdp
##         <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>       <dbl>
## 1  Afghanistan      Asia  1952  28.801  8425333  779.4453  6567086330
## 2  Afghanistan      Asia  1957  30.332  9240934  820.8530  7585448670
## 3  Afghanistan      Asia  1962  31.997 10267083  853.1007  8758855797
## 4  Afghanistan      Asia  1967  34.020 11537966  836.1971  9648014150
## 5  Afghanistan      Asia  1972  36.088 13079460  739.9811  9678553274
## 6  Afghanistan      Asia  1977  38.438 14880372  786.1134 11697659231
## 7  Afghanistan      Asia  1982  39.854 12881816  978.0114 12598563401
## 8  Afghanistan      Asia  1987  40.822 13867957  852.3959 11820990309
## 9  Afghanistan      Asia  1992  41.674 16317921  649.3414 10595901589
## 10 Afghanistan      Asia  1997  41.763 22227415  635.3414 14121995875
## # ... with 1,694 more rows
# Notice how the group component is added on
gapminder %>% group_by(year) 
## Source: local data frame [1,704 x 6]
## Groups: year [12]
## 
##        country continent  year lifeExp      pop gdpPercap
##         <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>
## 1  Afghanistan      Asia  1952  28.801  8425333  779.4453
## 2  Afghanistan      Asia  1957  30.332  9240934  820.8530
## 3  Afghanistan      Asia  1962  31.997 10267083  853.1007
## 4  Afghanistan      Asia  1967  34.020 11537966  836.1971
## 5  Afghanistan      Asia  1972  36.088 13079460  739.9811
## 6  Afghanistan      Asia  1977  38.438 14880372  786.1134
## 7  Afghanistan      Asia  1982  39.854 12881816  978.0114
## 8  Afghanistan      Asia  1987  40.822 13867957  852.3959
## 9  Afghanistan      Asia  1992  41.674 16317921  649.3414
## 10 Afghanistan      Asia  1997  41.763 22227415  635.3414
## # ... with 1,694 more rows
# Once things are grouped, we can summarize multiple rows like this
gapminder %>% group_by(year) %>%
  summarise(mean_pc_gdp = mean(gdpPercap))
## # A tibble: 12 × 2
##     year mean_pc_gdp
##    <int>       <dbl>
## 1   1952    3725.276
## 2   1957    4299.408
## 3   1962    4725.812
## 4   1967    5483.653
## 5   1972    6770.083
## 6   1977    7313.166
## 7   1982    7518.902
## 8   1987    7900.920
## 9   1992    8158.609
## 10  1997    9090.175
## 11  2002    9917.848
## 12  2007   11680.072
# This is how we would save the resultant data frame, and we could
# do this for any of the previous chunk statements.
avg_gdp_by_year = gapminder %>% group_by(year) %>%
  summarise(mean_pc_gdp = mean(gdpPercap))

Exercise 2b

Try to use mutate(), group_by, and summarise to subset your data to add on a gdp column to gapminder, and then find the average gdp for each country.

# R Code here

Introduction to ggplot2 and tidyr

Now we’re going to discuss visualizing data using the ggplot2 package. In doing so, we will also discover why tidy data is easy to work with, and therefore will learn about data reshaping/manipulation using tidyr.

Basics of ggplot2

The ggplot2 package revolves around the ggplot function. We first specify the data, then the aesthetics, and finally the type of plot we would like to make. Further customization is possible, but we won’t have time to talk much about those. the ggplot2 documentation is a really helpful reference for understanding how to customize plots.

We will work with the pew dataset that is part of the tidyr vignette, but we need to download it from online. The data contain a number of religious affiliations, and then frequency of followers falling into a variety of income brackets.

library(cowplot) # I prefer cowplot to ggplot default themes
pew = read_csv("https://raw.githubusercontent.com/hadley/tidyr/master/vignettes/pew.csv")

# Plot a scatterplot of the number of individauls in <10k bracket versus >150k
pew %>% ggplot(aes(x = `<$10k`, y = `>150k`)) + 
  geom_point()

# Plot bar plot for <10k bracket for all religions
# The x-axis labels overlap, but we can customize those
pew %>% ggplot(aes(x = religion, y = `<$10k`)) + 
  geom_bar(stat = "identity")

But what if we wanted to plot multiple income brackets? We would need to specify each column individually, and then somehow manually arrange them on the figure. This is where the tidyr package comes into play, which we will learn about next.

Exercise 3a

Use the gapminder dataset and try to create a boxplot (geom_boxplot()) for the life expectancy of each continent.

# R Code here

Basics of tidyr

The tidyr package contains many useful functions for cleaning and reshaping your data, but we will mainly talk about two of those (spread and gather). These functions are used to respectively convert long data to wide and wide data to long. Since it’s more common for data to start out in wide format, we will primarily focus on the gather function. Tidy data are defined by the following two characteristics:

  1. Every variable forms a column
  2. Each observation forms a row

What form are our pew data in?

pew %>% head()
## # A tibble: 6 × 11
##             religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k`
##                <chr>   <int>     <int>     <int>     <int>     <int>
## 1           Agnostic      27        34        60        81        76
## 2            Atheist      12        27        37        52        35
## 3           Buddhist      27        21        30        34        33
## 4           Catholic     418       617       732       670       638
## 5 Don’t know/refused      15        14        15        11        10
## 6   Evangelical Prot     575       869      1064       982       881
## # ... with 5 more variables: `$50-75k` <int>, `$75-100k` <int>,
## #   `$100-150k` <int>, `>150k` <int>, `Don't know/refused` <int>

gather

The pew data are in what we call “wide” data format, where each row corresponds to a class and the columns correspond to characterisitics (observations) of that class. If we wanted to plot a bar for each observation, we need to convert our “wide” data to “long” formatting using the gather function from tidyr.

The format of gather should be as follows: gather(key = observation_name, value = data_name, columns_being_gathered) Where the observation_name is what you want to call the column that stores the column that stores the observation type, data_name should be what you want the column that stores the specific data values to be called, and columns_being_gathered should be the columns (using the same syntax as select) that you want to be gathered. spread has similar syntax and can be used to reverse gathering.

# We want all columns gathered except for the religion column:
pew %>% gather(key=income, value=frequency, -religion)
## # A tibble: 180 × 3
##                   religion income frequency
##                      <chr>  <chr>     <int>
## 1                 Agnostic  <$10k        27
## 2                  Atheist  <$10k        12
## 3                 Buddhist  <$10k        27
## 4                 Catholic  <$10k       418
## 5       Don’t know/refused  <$10k        15
## 6         Evangelical Prot  <$10k       575
## 7                    Hindu  <$10k         1
## 8  Historically Black Prot  <$10k       228
## 9        Jehovah's Witness  <$10k        20
## 10                  Jewish  <$10k        19
## # ... with 170 more rows
# Alternatively:
pew %>% gather(key=income, value=frequency, 2:11)
## # A tibble: 180 × 3
##                   religion income frequency
##                      <chr>  <chr>     <int>
## 1                 Agnostic  <$10k        27
## 2                  Atheist  <$10k        12
## 3                 Buddhist  <$10k        27
## 4                 Catholic  <$10k       418
## 5       Don’t know/refused  <$10k        15
## 6         Evangelical Prot  <$10k       575
## 7                    Hindu  <$10k         1
## 8  Historically Black Prot  <$10k       228
## 9        Jehovah's Witness  <$10k        20
## 10                  Jewish  <$10k        19
## # ... with 170 more rows
# Use the spread to reverse gathering
pew %>% gather(key=income, value=frequency, 2:11) %>%
  spread(income, frequency)
## # A tibble: 18 × 11
##                   religion `<$10k` `>150k` `$10-20k` `$100-150k` `$20-30k`
## *                    <chr>   <int>   <int>     <int>       <int>     <int>
## 1                 Agnostic      27      84        34         109        60
## 2                  Atheist      12      74        27          59        37
## 3                 Buddhist      27      53        21          39        30
## 4                 Catholic     418     633       617         792       732
## 5       Don’t know/refused      15      18        14          17        15
## 6         Evangelical Prot     575     414       869         723      1064
## 7                    Hindu       1      54         9          48         7
## 8  Historically Black Prot     228      78       244          81       236
## 9        Jehovah's Witness      20       6        27          11        24
## 10                  Jewish      19     151        19          87        25
## 11           Mainline Prot     289     634       495         753       619
## 12                  Mormon      29      42        40          49        48
## 13                  Muslim       6       6         7           8         9
## 14                Orthodox      13      46        17          42        23
## 15         Other Christian       9      12         7          14        11
## 16            Other Faiths      20      41        33          40        40
## 17   Other World Religions       5       4         2           4         3
## 18            Unaffiliated     217     258       299         321       374
## # ... with 5 more variables: `$30-40k` <int>, `$40-50k` <int>,
## #   `$50-75k` <int>, `$75-100k` <int>, `Don't know/refused` <int>

Exercise 3b

Let’s say you want to drop all of the pew data of people who are in the Don't know/refused column. Gather all of the income frequencies for all people in the columns that answered the question, without the Don't know/refused column.

Does the order in which you do things matter?

# R Code here

Putting it all together

We now have gone through all of the basics for data manipulation, analysis, and plotting. The cool thing about the tidyverse is that you can link as many of these pipes together as you’d like. I’d say you probably want to limit to a reasonable number for readability, but we can explore the pew data much more fully now. What if we wanted to plot the bar chart we had before, but with bars for all of the income brackets colored in?

pew %>% gather(income, frequency, -religion) %>%
  ggplot(aes(x = religion, y = frequency, fill=income)) + 
  geom_bar(stat="identity") 

Customizing specific theme elements takes a bit more work, but can usually be solved through a quick google search. For example, we can fix the x axis labels this way:

pew %>% gather(income, frequency, -religion) %>%
  ggplot(aes(x = religion, y = frequency, fill=income)) + 
  geom_bar(stat="identity") +
  theme(axis.text.x=element_text(angle=45, hjust = 1, vjust = 1))

Let’s go back to the gapminder dataset, as it is slightly more interesting for more complex examples.

# We can plot lines for the life expectancy over time for each country
gapminder %>%
  ggplot(aes(year, lifeExp, group=country)) + 
  geom_line()

# We can add a overall trendline
gapminder %>%
  ggplot(aes(year, lifeExp)) + 
  geom_point() +
  stat_smooth()

Exercise 4a

What if we wanted to plot the mean life expectancy for each year across all of the countries? Try using everything you’ve learned to make that plot in a few concise lines!

# R Code here

Exercise 4b

Now try plotting the mean life expectancy for just the continent of Africa. You should be able to do it by just adding one line to the chunk above!

# R Code here

Exercise 4c (Challenge)

If you’re up for a bit of a challenge try to make a scatterplot of the mean life expectancy versus the mean per capita gdp for each country and year. Hint: you can summarise two things at once to create two summary columns, and you can also group_by multiple columns.

# R Code Here